competitive programming
Owlgorithm: Supporting Self-Regulated Learning in Competitive Programming through LLM-Driven Reflection
Nieto-Cardenas, Juliana, Kramer, Erin Joy, Kurto, Peter, Dickey, Ethan, Bejarano, Andres
We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm's educational impact. The underlying framework may also generalize to other reflective learning contexts.
- North America > United States > Missouri > Jackson County > Kansas City (0.14)
- North America > United States > New York > New York County > New York City (0.06)
- North America > United States > Missouri > St. Louis County > St. Louis (0.05)
- (9 more...)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.88)
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
Deng, Han, Meng, Yuan, Tang, Shixiang, Ouyang, Wanli, Ma, Xinzhu
Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. In this paper, we propose a new problem, similar question retrieval, to tackle this issue. Due to the lack of both data and models, solving this problem is challenging. To this end, we introduce CPRet, a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks: two code-centric (i.e., Text-to-Code, Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate, Simplified-to-Full) built from a combination of automatically crawled problem-solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. Besides, we further develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem-code alignment, and CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks. Github: https://github.com/coldchair/CPRet Online Demo: https://www.cpret.online/
AutoCode: LLMs as Problem Setters for Competitive Programming
Zhou, Shang, Zheng, Zihan, Liu, Kaiyuan, Shen, Zeyu, Cheng, Zerui, Chen, Zexing, He, Hansen, Yao, Jianzhu, Mao, Huanzhi, Mang, Qiuyang, Fu, Tianfu, Li, Beichen, Li, Dongruixuan, Chai, Wenhao, Liu, Zhuang, Korolova, Aleksandra, Henderson, Peter, Jaques, Natasha, Viswanath, Pramod, Xie, Saining, Shang, Jingbo
Writing competitive programming problems is exacting. Authors must: set constraints, input distributions, and edge cases that rule out shortcuts; target specific algorithms (e.g., max-flow, dynamic programming, data structures); and calibrate complexity beyond the reach of most competitors. We argue that this makes for an ideal test of general large language model capabilities and study whether they can do this reliably. We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases. On held-out problems, AutoCode test suites approach 99% consistency with official judgments, a significant improvement over current state-of-the-art methods like HardTests, which achieve less than 81%. Furthermore, starting with a random seed problem, AutoCode can create novel variants with reference and brute-force solutions. By cross-verifying these generated solutions against test cases, we can further filter out malformed problems. Our system ensures high correctness, as verified by human experts. AutoCode successfully produces novel problems judged by Grandmaster-level (top 0.3%) competitive programmers to be of contest quality.
- North America > United States > Massachusetts (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?
Hosain, Md Tanzib, Morol, Md Kishor
Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1\% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2\%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at https://github.com/kraritt/zolve.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Bangladesh (0.04)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
Yang, Xinwei, Liu, Zhaofeng, Huang, Chen, Zhang, Jiashuai, Zhang, Tong, Zhang, Yifan, Lei, Wenqiang
While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at https://github.com/SCUNLP/ELABORATION
- Asia > Singapore (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Asia > China > Sichuan Province (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming
Sung, Sicheol, Aditi, null, kim, Dogyu, Han, Yo-Sub, Ko, Sang-Ki
Automated Test Case Generation (ATCG) is crucial for evaluating software reliability, particularly in competitive programming where robust algorithm assessments depend on diverse and accurate test cases. However, existing ATCG methods often fail to meet complex specifications or generate effective corner cases, limiting their utility. In this work, we introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications. Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases. Experiments on the CodeContests dataset demonstrate that CCFG-based test cases outperform baseline methods in identifying incorrect algorithms, achieving significant gains in validity and effectiveness. Our approach provides a scalable and reliable grammar-driven framework for enhancing automated competitive programming evaluations.
ProBench: Benchmarking Large Language Models in Competitive Programming
Yang, Lei, Jin, Renren, Shi, Ling, Peng, Jianxiang, Chen, Yue, Xiong, Deyi
With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.
Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis
Dumitran, Adrian Marius, Badea, Adrian Catalin, Muscalu, Stefan-Gabriel
This study explores the performance of large language models (LLMs) in solving competitive programming problems from the Romanian Informatics Olympiad at the county level. Romania, a leading nation in computer science competitions, provides an ideal environment for evaluating LLM capabilities due to its rich history and stringent competition standards. We collected and analyzed a dataset comprising 304 challenges from 2002 to 2023, focusing on solutions written by LLMs in C++ and Python for these problems. Our primary goal is to understand why LLMs perform well or poorly on different tasks. We evaluated various models, including closed-source models like GPT-4 and open-weight models such as CodeLlama and RoMistral, using a standardized process involving multiple attempts and feedback rounds. The analysis revealed significant variations in LLM performance across different grades and problem types. Notably, GPT-4 showed strong performance, indicating its potential use as an educational tool for middle school students. We also observed differences in code quality and style across various LLMs
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.05)
- Asia > Japan (0.04)
Competitive programming with AlphaCode
Creating solutions to unforeseen problems is second nature in human intelligence – a result of critical thinking informed by experience. The machine learning community has made tremendous progress in generating and understanding textual data, but advances in problem solving remain limited to relatively simple maths and programming problems, or else retrieving and copying existing solutions. As part of DeepMind's mission to solve intelligence, we created a system called AlphaCode that writes computer programs at a competitive level. AlphaCode achieved an estimated rank within the top 54% of participants in programming competitions by solving new problems that require a combination of critical thinking, logic, algorithms, coding, and natural language understanding. In our preprint, we detail AlphaCode, which uses transformer-based language models to generate code at an unprecedented scale, and then smartly filters to a small set of promising programs.
Guide to Competitive Programming: Learning and Improving Algorithms Through Contests (Undergraduate Topics in Computer Science): Laaksonen, Antti: 9783030393564: Amazon.com: Books
Topics and features: introduces dynamic programming and other fundamental algorithm design techniques, and investigates a wide selection of graph algorithms; compatible with the IOI Syllabus, yet also covering more advanced topics, such as maximum flows, Nim theory, and suffix structures; surveys specialized algorithms for trees, and discusses the mathematical topics that are relevant in competitive programming; reviews the features of the C programming language, and describes how to create efficient algorithms that can quickly process large data sets; discusses sorting algorithms and binary search, and examines a selection of data structures of the C standard library; covers such advanced algorithm design topics as bit-parallelism and amortized analysis, and presents a focus on efficiently processing array range queries; describes a selection of more advanced topics, including square-root algorithms and dynamic programming optimization.